Conversation
There was a problem hiding this comment.
Pull request overview
Adds benchmark/training convenience scripts and configuration tweaks for MI355X DeepSeek pretraining runs, plus enhanced training-log memory reporting and some workflow/build adjustments.
Changes:
- Add new launcher scripts for DeepSeek V3 and DeepSeek V2 Lite training, and a C4 data preparation helper.
- Extend Megatron training-log patching to report the rank with max ROCm memory usage.
- Update Slurm/docker launch behavior and Megatron prepare/build flags; adjust DeepSeek V3 MI355X config defaults; comment out CI unit-test jobs.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| start_training_dsv3.sh | New MI355X DeepSeek V3 Slurm launch script with env defaults and run flags. |
| start_training_dsv2_lite.sh | New MI355X DeepSeek V2 Lite Slurm launch script with env defaults and run flags. |
| prepare_c4_data.sh | New helper to merge pre-downloaded C4 shards and tokenize for Megatron. |
| primus/backends/megatron/patches/training_log/print_rank_last_patches.py | Add distributed aggregation to log the max ROCm memory usage rank. |
| examples/run_slurm_pretrain.sh | Allow overriding srun time/partition via env. |
| examples/run_pretrain.sh | Update AINIC NCCL IB TC/FIFO TC env settings. |
| examples/run_local_pretrain.sh | Prefer docker over podman and pull image before running. |
| examples/megatron/prepare.py | Install Emerging-Optimizers with --no-build-isolation. |
| examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml | Change PP default and add Turbo/Deepep-related overrides. |
| .github/workflows/ci.yaml | Comment out torch/jax unit-test jobs. |
examples/run_local_pretrain.sh
Outdated
| docker_podman_proxy pull "$DOCKER_IMAGE" | ||
|
|
There was a problem hiding this comment.
Unconditionally pulling the image before running means this script will now fail in offline/air-gapped environments (or when using a locally-built image tag) even if the image is already present locally. Consider making the pull optional (e.g., behind a PULL_DOCKER_IMAGE flag) or using a runtime option like "--pull=missing" where supported.
| export NCCL_IB_HCA="ionic_0:1,ionic_2:1,ionic_3:1,ionic_4:1,ionic_5:1,ionic_7:1,ionic_8:1,ionic_9:1" | ||
| export GLOO_SOCKET_IFNAME=ens9np0 | ||
| export NCCL_SOCKET_IFNAME=ens9np0 | ||
| export CLEAN_DOCKER_CONTAINER=1 |
There was a problem hiding this comment.
CLEAN_DOCKER_CONTAINER=1 will cause examples/run_local_pretrain.sh to remove all containers on the host (it runs "docker/podman ps -aq" and rm -f each). That’s a risky default for shared nodes; consider defaulting this to 0 and only enabling it explicitly when you’re sure it’s safe.
| export CLEAN_DOCKER_CONTAINER=1 | |
| # Set to 1 to allow run_slurm_pretrain.sh to clean up all Docker/Podman containers on the host. | |
| # Use 1 only on dedicated/non-shared nodes where this is safe. | |
| export CLEAN_DOCKER_CONTAINER=0 |
start_training_dsv3.sh
Outdated
| export PRIMUS_USER=tas | ||
| # export PRIMUS_TOKENIZED_DATA_PATH=/shared_aig/c4/tokenized/c4_en_train_text_document # this is the tokenized data path for the training | ||
| export PRIMUS_EXP_NAME=dsv3-pretrain-mbs_$MBS-gbs_$GBS-PP_$PRIMUS_PP-EP_$PRIMUS_EP-VPP_$PRIMUS_VPP-turbodeepep_$TURBO_DEEPEEP-legacygg_$LEGACY_GG-profile_$PROFILE | ||
| export PRIMUS_EXP_NAME=debugv3 |
There was a problem hiding this comment.
PRIMUS_EXP_NAME is assigned twice; the second assignment overrides the descriptive name built from the run parameters, so logs/output will always go to the hard-coded "debugv3" directory. Remove the second export or gate it behind an explicit debug flag to avoid accidental output collisions.
| export PRIMUS_EXP_NAME=debugv3 | |
| : "${PRIMUS_DEBUG_EXP_NAME:=0}" | |
| if [ "$PRIMUS_DEBUG_EXP_NAME" = "1" ]; then | |
| export PRIMUS_EXP_NAME=debugv3 | |
| fi |
| # By default downloads 1 shard (~350MB compressed, ~3M documents) for testing. | ||
| # Full C4-en has 1024 shards. Adjust --num_shards for more data. | ||
| ############################################################################### | ||
|
|
||
| set -e | ||
|
|
||
| # ======================== Configuration ======================== | ||
| NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024) | ||
| DATA_DIR=${DATA_DIR:-"/shared/c4"} | ||
| PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"} | ||
| TOKENIZER_TYPE="DeepSeekV3Tokenizer" | ||
| TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3" | ||
| WORKERS=${WORKERS:-$(nproc)} # Number of preprocessing workers | ||
| HF_TOKEN=${HF_TOKEN:-"your_hf_token"} # Set your HuggingFace token |
There was a problem hiding this comment.
The header comment says the script "downloads" C4 shards and that the default is 1 shard, but the implementation explicitly skips downloading and defaults NUM_SHARDS to 200. Please align the documentation and defaults with the actual behavior (either implement download, or update comments and set NUM_SHARDS default accordingly).
| for i in $(seq 0 $((NUM_SHARDS - 1))); do | ||
| SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i") | ||
| SHARD_PATH="${RAW_DIR}/${SHARD_NAME}" | ||
| echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..." | ||
| zcat "${SHARD_PATH}" >> "${JSONL_FILE}" | ||
| done | ||
|
|
There was a problem hiding this comment.
This merge step appends directly into the final JSONL_FILE with '>>'. If the script is interrupted or a shard is corrupt, you'll end up with a partial JSONL that then causes future runs to skip merging because the file exists. Write to a temporary file and atomically move it into place on success (and/or validate the output) to avoid leaving a bad cached artifact.
| for i in $(seq 0 $((NUM_SHARDS - 1))); do | |
| SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i") | |
| SHARD_PATH="${RAW_DIR}/${SHARD_NAME}" | |
| echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..." | |
| zcat "${SHARD_PATH}" >> "${JSONL_FILE}" | |
| done | |
| # Write to a temporary file first to avoid leaving a corrupted final JSONL | |
| TMP_JSONL_FILE="$(mktemp "${JSONL_DIR}/c4_en_train.jsonl.tmp.XXXXXX")" | |
| # Ensure the temporary file is cleaned up on failure or interruption | |
| cleanup_tmp() { | |
| if [ -n "${TMP_JSONL_FILE:-}" ] && [ -f "${TMP_JSONL_FILE}" ]; then | |
| rm -f "${TMP_JSONL_FILE}" | |
| fi | |
| } | |
| trap cleanup_tmp EXIT INT TERM | |
| for i in $(seq 0 $((NUM_SHARDS - 1))); do | |
| SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i") | |
| SHARD_PATH="${RAW_DIR}/${SHARD_NAME}" | |
| echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..." | |
| zcat "${SHARD_PATH}" >> "${TMP_JSONL_FILE}" | |
| done | |
| # Basic validation: ensure the merged file is non-empty before finalizing | |
| if [ ! -s "${TMP_JSONL_FILE}" ]; then | |
| echo "ERROR: Merged JSONL is empty; aborting." | |
| cleanup_tmp | |
| exit 1 | |
| fi | |
| # Move the completed temp file into place atomically | |
| mv "${TMP_JSONL_FILE}" "${JSONL_FILE}" | |
| # Prevent trap from deleting the now-final JSONL file | |
| TMP_JSONL_FILE="" | |
| trap - EXIT INT TERM |
| export NCCL_IB_HCA="ionic_0:1,ionic_2:1,ionic_3:1,ionic_4:1,ionic_5:1,ionic_7:1,ionic_8:1,ionic_9:1" | ||
| export GLOO_SOCKET_IFNAME=ens9np0 | ||
| export NCCL_SOCKET_IFNAME=ens9np0 | ||
| export CLEAN_DOCKER_CONTAINER=1 |
There was a problem hiding this comment.
CLEAN_DOCKER_CONTAINER=1 will cause examples/run_local_pretrain.sh to remove all containers on the host (it runs "docker/podman ps -aq" and rm -f each). That’s a risky default for shared nodes; consider defaulting this to 0 and only enabling it explicitly when you’re sure it’s safe.
| export CLEAN_DOCKER_CONTAINER=1 | |
| export CLEAN_DOCKER_CONTAINER=0 # Set to 1 to forcibly remove all Docker/Podman containers; use with caution on shared hosts |
| r_ratio = r_used / r_total | ||
|
|
||
| # get the max rocm_mem_usage | ||
| usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32) |
There was a problem hiding this comment.
usage_tensor is created as float32 even though r_used is a byte count on the order of 1e11 for large GPUs. float32 will lose integer precision at that scale, which can make max-rank selection and reported GB values slightly inaccurate. Prefer using an integer dtype (e.g., int64) for the gathered byte counts, then format as GB on CPU.
| usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32) | |
| usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.int64) |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
There was a problem hiding this comment.
This script exports placeholder values for HF_TOKEN and WANDB_API_KEY, which will overwrite any real credentials already set in the environment and can cause training/CI runs to fail unless the file is edited. Prefer reading from the environment and erroring if unset (or only setting defaults when variables are empty), rather than hard-coding placeholder strings.
| export HF_TOKEN="your_hf_token" # make it your own hf token | |
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key | |
| # Require HF_TOKEN and WANDB_API_KEY to be provided via the environment. | |
| if [ -z "${HF_TOKEN:-}" ]; then | |
| echo "Error: HF_TOKEN environment variable is not set. Please export HF_TOKEN before running this script." >&2 | |
| exit 1 | |
| fi | |
| if [ -z "${WANDB_API_KEY:-}" ]; then | |
| echo "Error: WANDB_API_KEY environment variable is not set. Please export WANDB_API_KEY before running this script." >&2 | |
| exit 1 | |
| fi | |
| export HF_TOKEN | |
| export WANDB_API_KEY |
| NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024) | ||
| DATA_DIR=${DATA_DIR:-"/shared/c4"} | ||
| PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"} |
There was a problem hiding this comment.
PRIMUS_PATH defaults to a user-specific absolute path (/shared/john/Primus), which makes this script non-portable and likely to fail for other users/environments. Consider requiring PRIMUS_PATH to be provided (and exit with a clear message if unset) or deriving it relative to the repo root.
| NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024) | |
| DATA_DIR=${DATA_DIR:-"/shared/c4"} | |
| PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"} | |
| SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)" | |
| NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024) | |
| DATA_DIR=${DATA_DIR:-"/shared/c4"} | |
| PRIMUS_PATH=${PRIMUS_PATH:-"${SCRIPT_DIR}/../Primus"} | |
| if [[ ! -d "$PRIMUS_PATH" ]]; then | |
| echo "Error: PRIMUS_PATH is not set to a valid directory: '$PRIMUS_PATH'" >&2 | |
| echo "Please set PRIMUS_PATH explicitly, for example:" >&2 | |
| echo " export PRIMUS_PATH=/path/to/Primus" >&2 | |
| exit 1 | |
| fi |
| # run-unittest-torch: | ||
| # env: | ||
| # PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch | ||
| # needs: [code-lint] | ||
| # runs-on: [primus-lm-cicd-torch-j8knc] | ||
| # steps: | ||
| # - run: echo "🎉 Begin Primus-Turbo Checkout." | ||
| # - name: Set commit hash to env |
There was a problem hiding this comment.
The unit test jobs for both torch and jax are commented out, which removes automated test execution from CI and can allow regressions (including in this PR) to merge unnoticed. If this is temporary, consider gating them behind a workflow input/label, reducing the test scope, or moving them to a separate workflow instead of disabling them entirely.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (2)
prepare_c4_data.sh:30
HF_TOKENis defined here but is never used by this script (Step 1 explicitly skips download, and Step 2 calls preprocess_data.py without passing auth). Either remove this variable or use it in the download/auth flow to avoid implying it has an effect.
TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3"
WORKERS=${WORKERS:-$(nproc)} # Number of preprocessing workers
HF_TOKEN=${HF_TOKEN:-"your_hf_token"} # Set your HuggingFace token
prepare_c4_data.sh:139
- The summary message references
run_dsv3.sh, but this repo’s training entrypoints appear to bestart_training_dsv3.sh/examples/run_slurm_pretrain.sh. Consider updating the wording to point to the actual script(s) users should run to avoid confusion.
echo "To use this data for training, set in run_dsv3.sh:"
echo ""
echo " 1. Change: --mock_data True → --mock_data False"
echo " 2. Add env: export PRIMUS_TOKENIZED_DATA_PATH=${TRAIN_OUTPUT_PREFIX}_text_document"
| --time="${SLURM_TIME:-07:00:00}" \ | ||
| --nodelist="${SLURM_NODELIST:-}" \ | ||
| --partition="${SLURM_PARTITION:-amd-aig}" \ |
There was a problem hiding this comment.
Passing --nodelist with an empty value (when SLURM_NODELIST is unset) causes srun to fail with an invalid nodelist. Consider only adding --nodelist when SLURM_NODELIST is non-empty (or default to omitting the flag entirely).
| ARGS=("$@") | ||
|
|
||
| VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH") | ||
| VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4") |
There was a problem hiding this comment.
This hard-codes an extra bind mount to /shared_aig/c4, which will break (or unintentionally create an empty host directory) on systems that don’t have that path. Consider making this conditional/configurable via an env var (e.g., EXTRA_VOLUME_ARGS / PRIMUS_EXTRA_MOUNTS) rather than always mounting a cluster-specific path.
| VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4") | |
| VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH") | |
| # Optional extra volume mounts: set PRIMUS_EXTRA_MOUNTS to a string like: | |
| # '-v /shared_aig/c4:/shared_aig/c4 -v /other/path:/other/path:ro' | |
| if [[ -n "${PRIMUS_EXTRA_MOUNTS:-}" ]]; then | |
| # Intentional word splitting to allow multiple -v arguments. | |
| VOLUME_ARGS+=(${PRIMUS_EXTRA_MOUNTS}) | |
| elif [[ -d "/shared_aig/c4" ]]; then | |
| # Backwards-compatible default: only mount /shared_aig/c4 if it exists. | |
| VOLUME_ARGS+=(-v "/shared_aig/c4:/shared_aig/c4") | |
| fi |
| set -e | ||
|
|
||
| # ======================== Configuration ======================== | ||
| NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024) |
There was a problem hiding this comment.
The header comment says the default downloads 1 shard for testing, but NUM_SHARDS is set to default to 200. Please update either the comment or the default so they match (and avoid surprising users by downloading/processing 200 shards by default).
| NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024) | |
| NUM_SHARDS=${NUM_SHARDS:-1} # Number of C4 shards to download (1-1024) |
| ret = subprocess.run( | ||
| ["pip", "install", "--no-build-isolation", "-e", str(emerging_optimizers_path)], check=True | ||
| ) | ||
| if ret.returncode != 0: | ||
| log_error_and_exit("Building Emerging Optimizers failed.") |
There was a problem hiding this comment.
subprocess.run(..., check=True) will raise an exception on failure, so the subsequent if ret.returncode != 0: branch is redundant/unreachable. Either remove the returncode check, or set check=False and keep the explicit returncode handling (and ideally invoke pip via sys.executable -m pip to ensure the active Python environment is used).
| # run-unittest-torch: | ||
| # env: | ||
| # PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch | ||
| # needs: [code-lint] | ||
| # runs-on: [primus-lm-cicd-torch-j8knc] | ||
| # steps: | ||
| # - run: echo "🎉 Begin Primus-Turbo Checkout." | ||
| # - name: Set commit hash to env | ||
| # run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV | ||
| # - name: Checkout Repo Primus-Turbo | ||
| # uses: actions/checkout@v4 | ||
| # with: | ||
| # repository: AMD-AIG-AIMA/Primus-Turbo | ||
| # submodules: "recursive" | ||
| # path: Primus-Turbo | ||
| # ref: ${{ env.PRIMUS_TURBO_COMMIT }} | ||
| # - run: echo "Begin Primus-Turbo Install." | ||
| # - name: Install Primus-Turbo | ||
| # run: | | ||
| # mv Primus-Turbo /tmp/ | ||
| # echo "Primus-Turbo dir: /tmp/Primus-Turbo" | ||
| # git config --global --add safe.directory /tmp/Primus-Turbo | ||
| # cd /tmp/Primus-Turbo | ||
| # start_time=$(date +%s) | ||
| # echo "✅ [Pip install requirements] started at: $(date)" | ||
| # mkdir -p ${PRIMUS_WORKDIR}/primus-cache | ||
| # MAX_JOBS=128 pip install --cache-dir=${PRIMUS_WORKDIR}/primus-cache --no-build-isolation --no-clean -r requirements.txt |
There was a problem hiding this comment.
The Torch/JAX unit test jobs are entirely commented out, which effectively disables CI test execution for PRs and can allow regressions to merge undetected. If this is only needed temporarily, consider gating the jobs behind an if: condition (e.g., on a label/branch) rather than removing them from the workflow, or re-enable them before merging.
| # run-unittest-torch: | |
| # env: | |
| # PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch | |
| # needs: [code-lint] | |
| # runs-on: [primus-lm-cicd-torch-j8knc] | |
| # steps: | |
| # - run: echo "🎉 Begin Primus-Turbo Checkout." | |
| # - name: Set commit hash to env | |
| # run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV | |
| # - name: Checkout Repo Primus-Turbo | |
| # uses: actions/checkout@v4 | |
| # with: | |
| # repository: AMD-AIG-AIMA/Primus-Turbo | |
| # submodules: "recursive" | |
| # path: Primus-Turbo | |
| # ref: ${{ env.PRIMUS_TURBO_COMMIT }} | |
| # - run: echo "Begin Primus-Turbo Install." | |
| # - name: Install Primus-Turbo | |
| # run: | | |
| # mv Primus-Turbo /tmp/ | |
| # echo "Primus-Turbo dir: /tmp/Primus-Turbo" | |
| # git config --global --add safe.directory /tmp/Primus-Turbo | |
| # cd /tmp/Primus-Turbo | |
| # start_time=$(date +%s) | |
| # echo "✅ [Pip install requirements] started at: $(date)" | |
| # mkdir -p ${PRIMUS_WORKDIR}/primus-cache | |
| # MAX_JOBS=128 pip install --cache-dir=${PRIMUS_WORKDIR}/primus-cache --no-build-isolation --no-clean -r requirements.txt | |
| run-unittest-torch: | |
| env: | |
| PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch | |
| needs: [code-lint] | |
| runs-on: [primus-lm-cicd-torch-j8knc] | |
| steps: | |
| - run: echo "🎉 Begin Primus-Turbo Checkout." | |
| - name: Set commit hash to env | |
| run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV | |
| - name: Checkout Repo Primus-Turbo | |
| uses: actions/checkout@v4 | |
| with: | |
| repository: AMD-AIG-AIMA/Primus-Turbo | |
| submodules: "recursive" | |
| path: Primus-Turbo | |
| ref: ${{ env.PRIMUS_TURBO_COMMIT }} | |
| - run: echo "Begin Primus-Turbo Install." | |
| - name: Install Primus-Turbo | |
| run: | | |
| mv Primus-Turbo /tmp/ | |
| echo "Primus-Turbo dir: /tmp/Primus-Turbo" | |
| git config --global --add safe.directory /tmp/Primus-Turbo | |
| cd /tmp/Primus-Turbo | |
| start_time=$(date +%s) | |
| echo "✅ [Pip install requirements] started at: $(date)" | |
| mkdir -p ${PRIMUS_WORKDIR}/primus-cache | |
| MAX_JOBS=128 pip install --cache-dir=${PRIMUS_WORKDIR}/primus-cache --no-build-isolation --no-clean -r requirements.txt |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
There was a problem hiding this comment.
This script exports HF_TOKEN and WANDB_API_KEY as literal strings, which both encourages storing secrets in a committed file and also overwrites any values already present in the caller’s environment. Prefer reading these from the environment (and erroring if missing) rather than exporting placeholder values here.
| export HF_TOKEN="your_hf_token" # make it your own hf token | |
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key | |
| : "${HF_TOKEN:?Environment variable HF_TOKEN must be set}" | |
| : "${WANDB_API_KEY:?Environment variable WANDB_API_KEY must be set}" | |
| export HF_TOKEN | |
| export WANDB_API_KEY |
| # get the max rocm_mem_usage | ||
| usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32) | ||
| world_size = torch.distributed.get_world_size() | ||
| gathered_usage = [torch.zeros_like(usage_tensor) for _ in range(world_size)] | ||
| torch.distributed.all_gather(gathered_usage, usage_tensor) | ||
|
|
||
| rocm_mem_usages = [t.item() for t in gathered_usage] | ||
| max_usage = max(rocm_mem_usages) | ||
| max_rank = rocm_mem_usages.index(max_usage) | ||
|
|
||
| rocm_mem_str = ( | ||
| f" | rocm mem usage/free/total/usage_ratio: " | ||
| f"{r_used / 1024 ** 3:.2f}GB/" | ||
| f"{r_free / 1024 ** 3:.2f}GB/" | ||
| f"{r_total / 1024 ** 3:.2f}GB/" | ||
| f"{r_ratio * 100:.2f}%" | ||
| f" | rank-{max_rank} rocm max mem usage/usage_ratio: " | ||
| f"{max_usage / 1024 ** 3:.2f}GB/" | ||
| f"{max_usage / r_total * 100:.2f}%" |
There was a problem hiding this comment.
max_usage / r_total uses the local rank’s r_total when computing the max-usage ratio, which can be incorrect if ranks have different total HBM sizes (or if totals differ for any reason). To make this accurate, gather each rank’s r_total (or gather precomputed r_used/r_total ratios) and compute the max ratio corresponding to max_rank.
| # parallel | ||
| tensor_model_parallel_size: ${PRIMUS_TP:1} | ||
| pipeline_model_parallel_size: ${PRIMUS_PP:1} | ||
| pipeline_model_parallel_size: ${PRIMUS_PP:8} |
There was a problem hiding this comment.
This config changes the default pipeline_model_parallel_size from the common default of 1 to 8, which is inconsistent with the MI355X DeepSeek-V3 FP8 config and most other configs. To avoid surprising users (and breaking single-node runs), consider keeping the default at 1 and relying on PRIMUS_PP to opt into PP>1.
| pipeline_model_parallel_size: ${PRIMUS_PP:8} | |
| pipeline_model_parallel_size: ${PRIMUS_PP:1} |
No description provided.